Load the required libraries
library(tidyverse)
library(ggplot2)
library(Matrix)
library(xgboost)
library(pROC)
source("SMAP13b - helpers.R")
We use a simulated data set of 711599 persons who are 18+ years old and representative of Amsterdam. The data frame contains the neighbourhood (bu_code), demographic features (age, sex, …), and survey answers to the health monitor question “Drank Alcohol in the past 12 months” (y = 0,1) for every individual. This simulated data set also contains the true probability of drinking alcohol (p) for every person:
# Load data set
data <- read_csv("SMAP_simulated.zip", show_col_types=F) %>%
mutate_if(is.character, as.factor)
data
## # A tibble: 711,599 × 17
## bu_code age hhsize hhincome hhassets oad sex ethnicity marital_status
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <fct>
## 1 BU036300… 53 1 13 1 99 woman other_no… single
## 2 BU036300… 33 2 78 84 99 man netherla… single
## 3 BU036300… 48 1 9 31 99 man other_we… single
## 4 BU036300… 34 2 29 45 99 man other_we… married
## 5 BU036300… 33 2 32 17 99 woman netherla… single
## 6 BU036300… 35 2 54 30 99 man netherla… single
## 7 BU036300… 45 2 73 35 99 woman netherla… married
## 8 BU036300… 47 1 49 95 99 woman other_no… single
## 9 BU036300… 27 2 23 1 99 woman netherla… single
## 10 BU036300… 40 2 86 100 99 man netherla… single
## # … with 711,589 more rows, and 8 more variables: education <fct>,
## # hhtype <fct>, hhhomeownership <fct>, hhincomesource <fct>, X <dbl>,
## # Y <dbl>, p <dbl>, y <dbl>
Small Area Estimation (SAE) seeks to predict the outcome in areas with only few people.
Our goal is to estimate how many people drink alcohol (y) in each area (bu_code) based on the survey. Calculate the true prevalence and population size in each area:
# Calculate prevalence in each bu_code
bu_code.prevalence <- data %>% group_by(bu_code) %>% summarize(y=mean(y), n=n())
bu_code.prevalence
## # A tibble: 445 × 3
## bu_code y n
## <fct> <dbl> <int>
## 1 BU03630000 0.868 1028
## 2 BU03630001 0.889 665
## 3 BU03630002 0.839 1518
## 4 BU03630003 0.894 339
## 5 BU03630004 0.865 645
## 6 BU03630101 0.905 527
## 7 BU03630102 0.870 416
## 8 BU03630103 0.867 767
## 9 BU03630104 0.865 728
## 10 BU03630105 0.879 692
## # … with 435 more rows
In reality, the problem is that we can have very few people answering the survey in an area, which means that taking the mean of their responses is very unreliable. But what if we could predict the missing survey responses for the entire population? Then we would have much more data in every area and make estimates like this….
Lets plot this data. In Small Area Estimation we try to make more accurate estimates of this map:
# Load an sf geometry polygon (geometry) for each neighbourhood (bu_code)
bu_code.geometry <- laad_geodata_wfs(region="buurt", year=2020)
colnames(bu_code.geometry) <- c("bu_code", "name", "geometry")
# Add geometry information and calculate quantiles
bu_code.prevalence <- bu_code.prevalence %>%
merge(bu_code.geometry) %>%
mutate(prevalence=cut(y, quantile(y, probs=seq(0,1,0.1))))
# Plot
p1 <- ggplot(bu_code.prevalence) +
geom_sf(aes(geometry=geometry, fill=prevalence), color=NA) +
scale_fill_brewer(palette="RdYlGn", direction=-1) +
labs(title="Small Area Estimates") +
theme(legend.position="bottom")
p1
We now investigate the performance if we had the entire population of Amsterdam to train with. You can think of this like “Ask people in Amsterdam, then generalize to the entire Netherlands”. This doesn’t seem not too far from the scenario where around 4% of the population of Netherlands had answered the survey and we predicted the prevalence in all neighbourhoods of Netherlands.
Advanced knowledge: is this an accurate estimate of generalization ability?
How do we evaluate how well the model predicts?
One option is to split the observed survey responses data set into a “training” and “test” set. We know the true answers in the “test” set. The model is fitted to the “training” set and the predicted answers in the “test” set are compared to the true answers. It is essential that we compare the model predictions to outcomes that the model has not seen! Many machine learning methods are so flexible that they could fit the training data perfectly, but this might not generalize well outside that training set.
Split Amsterdam data into train & test (75% in train)
set.seed(42)
train.idx <- sample.int(n=nrow(data), size=round(0.75*nrow(data)), replace=F)
train <- (1:nrow(data) %in% train.idx)
test <- !(1:nrow(data) %in% train.idx)
# How many rows?
print(paste(sum(train), sum(test), sep=" / "))
## [1] "533699 / 177900"
We use the AUC to compare the models in this example. Because we know the probabilities used to generate the data, we could define a perfect model
calculate_auc <- function (y_true, y_pred) auc(roc(y_true, y_pred, levels=c(0, 1), direction = '<', quiet=T))
# These are the true answers we seek to predict
true_values <- data[test,]$y
true_probability <- data[test,]$p
# This is the best possible performance, with real data there is no way to know this
print(calculate_auc(true_values, true_probability))
## Area under the curve: 0.8054
Lets start with the SMAP model. This is the ‘true model’ we used to simulate the data. To train the model we must give an explicit model specification:
# Train
ggd.bu_codes <- list(GM0363=unique(data$bu_code))
model <- smapmodel(ggd.bu_codes, bu_code.geometry)
## [1] "GM0363"
formula = y ~
s(age, by = sex, bs = "ps", k = 10) +
s(age, by = ethnicity, bs = "ps", k = 10) +
s(age, by = marital_status, bs = "ps", k = 10) +
s(age, by = education, bs = "ps", k = 10) +
s(sex, ethnicity, bs = "re") +
s(sex, marital_status, bs = "re") +
s(sex, education, bs = "re") +
s(hhtype, bs = "re") +
s(hhsize, bs = "ps", k = 5) +
s(hhincomesource, bs = "re") +
s(hhhomeownership, bs = "re") +
s(hhincome, bs = "ps", k = 10) +
s(hhassets, bs = "ps", k = 10) +
s(oad, bs = "ps", k = 10)
model <- fit.smapmodel(model, data[train,], formula)
# Predict
predictions <- predict.smapmodel(model, data[test,])
# The true model does very well
print(calculate_auc(true_values, predictions))
## Area under the curve: 0.8041
It is very easy to train the XGBoost model and predict with it:
# Define a matrix of features as input and a vector of labels as output:
formula <- y ~ age + sex + ethnicity + marital_status + education +
hhtype + hhsize + hhhomeownership + hhincomesource + hhincome + hhassets +
oad + X + Y
X <- sparse.model.matrix(formula, data=data)
y <- data$y
# Train
model <- xgboost(objective="binary:logistic", eval_metric="logloss", data=X[train,], label=y[train], nrounds=50, verbose=0)
# Predict
predictions <- predict(model, X[test,])
# XGBoost is pretty good
print(calculate_auc(true_values, predictions))
## Area under the curve: 0.8014
Fit a simple linear model and predict the unknown outcomes
# Train
model <- glm(formula, data = data[train,], family = "binomial")
# Predict
predictions <- predict(model, newdata = data[test,], type = "response")
# Simple logistic regression works quite well..
print(calculate_auc(true_values, predictions))
## Area under the curve: 0.7972
What do you think, are these differences significant?
Why did we set the number of trees to 50? A lucky guess ;). But lets investigate more formally.
We use 5-fold cross validation on the training set and for the number of trees 1, 2, 3, … , 300 we store:
With XGBoost, training the model means adding one more tree at each iteration.
Luckily XGBoost has a ready interface for this. We could also program it manually. This is the most important ‘hyperparameter’ of XGBoost, at least in most problems. Save the train & test AUC as a function of number of trees (nrounds):
set.seed(42)
model <- xgb.cv(list(objective="binary:logistic", eval_metric="auc"), data=X[train,], label=y[train],
nrounds=300, verbose=1, nfold=5, stratified=T, print_every_n=10)
## [1] train-auc:0.769101+0.000843 test-auc:0.768030+0.002260
## [11] train-auc:0.799979+0.000469 test-auc:0.797023+0.001744
## [21] train-auc:0.806135+0.000378 test-auc:0.800846+0.001649
## [31] train-auc:0.809188+0.000394 test-auc:0.801637+0.001488
## [41] train-auc:0.811520+0.000480 test-auc:0.801771+0.001427
## [51] train-auc:0.813549+0.000413 test-auc:0.801687+0.001429
## [61] train-auc:0.815411+0.000531 test-auc:0.801562+0.001430
## [71] train-auc:0.817309+0.000478 test-auc:0.801395+0.001387
## [81] train-auc:0.818899+0.000541 test-auc:0.801212+0.001315
## [91] train-auc:0.820527+0.000516 test-auc:0.801017+0.001332
## [101] train-auc:0.822116+0.000601 test-auc:0.800733+0.001416
## [111] train-auc:0.823793+0.000606 test-auc:0.800524+0.001472
## [121] train-auc:0.825282+0.000514 test-auc:0.800273+0.001450
## [131] train-auc:0.826789+0.000549 test-auc:0.799933+0.001423
## [141] train-auc:0.828247+0.000578 test-auc:0.799680+0.001488
## [151] train-auc:0.829566+0.000576 test-auc:0.799425+0.001488
## [161] train-auc:0.830977+0.000557 test-auc:0.799131+0.001456
## [171] train-auc:0.832322+0.000494 test-auc:0.798860+0.001417
## [181] train-auc:0.833579+0.000513 test-auc:0.798577+0.001434
## [191] train-auc:0.834773+0.000501 test-auc:0.798356+0.001418
## [201] train-auc:0.835926+0.000558 test-auc:0.798062+0.001393
## [211] train-auc:0.837188+0.000443 test-auc:0.797810+0.001431
## [221] train-auc:0.838266+0.000484 test-auc:0.797536+0.001465
## [231] train-auc:0.839389+0.000460 test-auc:0.797225+0.001498
## [241] train-auc:0.840540+0.000518 test-auc:0.796870+0.001472
## [251] train-auc:0.841756+0.000468 test-auc:0.796586+0.001487
## [261] train-auc:0.842980+0.000523 test-auc:0.796350+0.001494
## [271] train-auc:0.844128+0.000510 test-auc:0.796073+0.001472
## [281] train-auc:0.845262+0.000519 test-auc:0.795845+0.001444
## [291] train-auc:0.846402+0.000500 test-auc:0.795592+0.001452
## [300] train-auc:0.847418+0.000587 test-auc:0.795344+0.001494
Overfitting seems to occur after about 50 trees. If we fit more trees, training performance keeps improving but the test performance decreases.
# Calculate max train and test AUC at different number of trees
evaluation.max <- model$evaluation %>%
gather(variable, value, c(train_auc_mean, test_auc_mean)) %>%
group_by(iter, variable) %>%
summarize(value=max(value))
## `summarise()` has grouped output by 'iter'. You can override using the
## `.groups` argument.
# Plot the results
ggplot(evaluation.max, aes(x=iter, y=value, group=variable, color=variable)) + geom_line() +
labs(title="XGBoost hyperparameters: number of trees", x="Iteration (nrounds)", y="AUC")
What is the optimal number of trees?
opt.params <- evaluation.max %>% ungroup %>% filter(variable=="test_auc_mean") %>% top_n(n=1)
## Selecting by value
opt.params
## # A tibble: 1 × 3
## iter variable value
## <dbl> <chr> <dbl>
## 1 39 test_auc_mean 0.802
You can save some time with ‘early stopping’. As before, we save the validation AUC while we are training the model. When the validation error stops improving, we stop the training. This speeds up the hyperparameter search a bit. It can be done like this:
set.seed(42)
# Train a model with early stopping
model <- xgb.cv(list(objective="binary:logistic", eval_metric="auc"), data=X[train,], label=y[train],
nrounds=300, verbose=1, nfold=5, stratified=T, print_every_n=10,
early_stopping_rounds=10, max_depth=6)
## [1] train-auc:0.769101+0.000843 test-auc:0.768030+0.002260
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 10 rounds.
##
## [11] train-auc:0.799979+0.000469 test-auc:0.797023+0.001744
## [21] train-auc:0.806135+0.000378 test-auc:0.800846+0.001649
## [31] train-auc:0.809188+0.000394 test-auc:0.801637+0.001488
## [41] train-auc:0.811520+0.000480 test-auc:0.801771+0.001427
## Stopping. Best iteration:
## [39] train-auc:0.811101+0.000424 test-auc:0.801792+0.001439
How much should you tune the hyperparameters for optimal results?
Lets plot the first learned tree. What do we learn?
model <- xgboost(objective="binary:logistic", eval_metric="logloss", data=as.matrix(X[train,]), label=y[train], nrounds=39, verbose=0)
xgb.plot.tree(model=model, trees=0:0, width=2000, height=8000)
SHAP values are one way to interpret the results. Every individual’s predicted outcome is decomposed into contribution of their features x_i1 + x_i2 + … + x_id = y_i, where the values of x_ij each get a Shapley value indicating their total contribution to the outcome. The math is elegant but a bit complex…
library(DALEX)
## Welcome to DALEX (version: 2.4.0).
## Find examples and detailed introduction at: http://ema.drwhy.ai/
## Additional features will be available after installation of: ggpubr.
## Use 'install_dependencies()' to get all suggested dependencies
##
## Attaching package: 'DALEX'
## The following object is masked from 'package:dplyr':
##
## explain
library(DALEXtra)
explain_xgb <- explain_xgboost(model = model, data=as.matrix(X[train,]), y = y[train], label = "XGBoost")
## Preparation of a new explainer is initiated
## -> model label : XGBoost
## -> data : 533699 rows 42 cols
## -> target variable : 533699 values
## -> predict function : yhat.xgb.Booster will be used ( [33m default [39m )
## -> predicted values : No value for predict function target column. ( [33m default [39m )
## -> model_info : package xgboost , ver. 1.5.2.1 , task classification ( [33m default [39m )
## -> predicted values : numerical, min = 0.01609423 , mean = 0.7697261 , max = 0.9889067
## -> residual function : difference between y and yhat ( [33m default [39m )
## -> residuals : numerical, min = -0.9832364 , mean = 3.736546e-05 , max = 0.9632571
## [32m A new explainer has been created! [39m
Variable importance plot displays on the aggregate data set level which features are important for prediction:
var_importance <- model_parts(explain_xgb)
plot(var_importance, show_boxplots=F)
On the aggregate data set level, we can use Partial Dependence Plots (PDPs) to investigate how a feature affects the predictions, for example:
pdp_fare <- model_profile(explain_xgb, variables = "age", type = "partial")
plot(pdp_fare)
How to interpret predictions for a given person? The following person seems to have a very high probability of drinking alcohol:
person <- as.matrix(X[2,,drop=F])
predict(explain_xgb, newdata = person)
## [1] 0.9707956
The person is a 33 year old Dutch man who is single well-to-do entrepreneur, Shapley values indicate these are all quite positive contributions to drinking:
shap_xgb = predict_parts(explainer = explain_xgb, new_observation = person, type = "shap", N=10)
plot(shap_xgb, show_boxplots=F)
In this particular problem, we know the true ‘white box’ statistical model that defines the full probability distribution for the data. We can plot the terms of this statistical model to have an understandable interpretation:
plot.smap(bu_code.geometry, "drinker_interpretation_smap.txt", ncol=3)
Does this look like what the ML intepretation was telling us?
Start with a simple model. Machine learning can help if you have a complex target and lots of data.
Do a train/test set split or cross-validation to test how well your model performs
Hyperparameter searches:
Interpreting the models:
Conclusion: Use machine learning when the goal is to predict. It is often easier and more accurate than statistical models. But start with the simplest model.
Exercise #1 : Try different machine learning models. For example, random forest. Does it do better?
library(ranger)
print("Random Forest")
model <- ranger(formula=formula, data=data[train,], num.trees=500, classification=T, probability=T, seed=42)
predictions <- predict(model, data[test,])$predictions[,1]
print(calculate_auc(true_values, predictions))
Exercise #2 : Try optimizing more hyperparameters. For example, the ‘depth of trees’ (max_depth, 6 is the default) can also be important. You can loop over different tree depths and use early stopping. This optimizes both depth of trees & number of trees.
#Try different tree depths, save validation AUC at each depth tree depth (max_depth) & number of trees (iter):
evaluation <- data.frame()
for (max_depth in c(1,2,3,4,5,6,7)) {
print(sprintf("===== Hyperparameters max_depth: %d ===== ", max_depth))
model <- xgb.cv(list(objective="binary:logistic", eval_metric="auc"), data=X[train,], label=y[train],
max_depth=max_depth, nrounds=3000, early_stopping_rounds=100, print_every_n=100,
verbose=1, nfold=5, stratified=T)
evaluation.eta <- data.table(max_depth = max_depth,
iter = model$evaluation_log$iter,
train_auc_mean = model$evaluation_log$train_auc_mean,
test_auc_mean = model$evaluation_log$test_auc_mean)
evaluation <- rbind(evaluation, evaluation.eta)
}
write.csv(evaluation, "amsterdam_hyperparameters.txt", row.names=F)
# Load the results from the previous exercise
evaluation <- read.csv("amsterdam_hyperparameters.txt")
# Calculate max train and test AUC at different tree depths
evaluation.max <- evaluation %>%
gather(variable, value, c(train_auc_mean, test_auc_mean)) %>%
group_by(max_depth, variable) %>%
summarize(value=max(value))
## `summarise()` has grouped output by 'max_depth'. You can override using the
## `.groups` argument.
# Plot the results. It looks like overfitting occurs after about tree depth 2
ggplot(evaluation.max, aes(x=max_depth, y=value, group=variable, color=variable)) + geom_line() +
labs(title="XGBoost hyperparameters: depth of trees", x="max_depth", y="AUC")
# What are the optimal parameters?
opt.params <- evaluation %>%
gather(variable, value, c(train_auc_mean, test_auc_mean)) %>%
filter(variable=="test_auc_mean") %>% top_n(n=1)
## Selecting by value
opt.params
# We get very close to the 'true model' with XGBoost
model <- xgboost(objective="binary:logistic", eval_metric="logloss", data=X[train,], label=y[train],
max_depth=opt.params$max_depth, nrounds=opt.params$iter, verbose=0)
predictions <- predict(model, X[test,])
print(calculate_auc(true_values, predictions))
EXERCISE #3: In the original survey we had the answers for ~4% of the relevant population. What if only 4% of Amsterdam residents answer the survey and we predict the missing 96%? How do you expect the models to perform?
set.seed(42)
train.idx <- sample.int(n=nrow(data), size=round(0.04*nrow(data)), replace=F)
train <- (1:nrow(data) %in% train.idx)
test <- !(1:nrow(data) %in% train.idx)
# How many rows?
print(paste(sum(train), sum(test), sep=" / "))
## [1] "28464 / 683135"
With this data set, train the models and calculate the AUC in the test set.
true_values <- data[test,]$y
print("STAR model")
ggd.bu_codes <- list(GM0363=unique(data$bu_code))
model <- smapmodel(ggd.bu_codes, bu_code.geometry)
## Warning in st_centroid.sf(.): st_centroid assumes attributes are constant over
## geometries of x
## Warning in st_centroid.sf(.): st_centroid assumes attributes are constant over
## geometries of x
formula = y ~
s(age, by = sex, bs = "ps", k = 10) +
s(age, by = ethnicity, bs = "ps", k = 10) +
s(age, by = marital_status, bs = "ps", k = 10) +
s(age, by = education, bs = "ps", k = 10) +
s(sex, ethnicity, bs = "re") +
s(sex, marital_status, bs = "re") +
s(sex, education, bs = "re") +
s(hhtype, bs = "re") +
s(hhsize, bs = "ps", k = 5) +
s(hhincomesource, bs = "re") +
s(hhhomeownership, bs = "re") +
s(hhincome, bs = "ps", k = 10) +
s(hhassets, bs = "ps", k = 10) +
s(oad, bs = "ps", k = 10)
model <- fit.smapmodel(model, data[train,], formula)
## Warning: Using formula(x) is deprecated when x is a character vector of length > 1.
## Consider formula(paste(x, collapse = " ")) instead.
predictions <- predict.smapmodel(model, data[test,])
print(calculate_auc(true_values, predictions))
formula <- y ~ age + sex + ethnicity + marital_status + education +
hhtype + hhsize + hhhomeownership + hhincomesource + hhincome + hhassets +
oad + X + Y
print("XGBoost")
model <- xgboost(objective="binary:logistic", eval_metric="logloss", data=X[train,], label=y[train], nrounds=50, verbose=0)
predictions <- predict(model, X[test,])
print(calculate_auc(true_values, predictions))
print("Linear model")
model <- glm(formula, data = data[train,], family = "binomial")
predictions <- predict(model, newdata = data[test,], type = "response")
print(calculate_auc(true_values, predictions))